Objective of this notebook is to study COVID-19 outbreak with the help of some basic visualizations techniques. Comparison of countries spread of COVID-19 in the World. Perform predictions and Time Series forecasting in order to study the impact and spread of the COVID-19 in comming days.
Datset sourse: The Roche Data Science Coalition (RDSC) is requesting the collaborative effort of the AI community to fight COVID-19. This challenge presents a curated collection of datasets from 20 global sources and asks you to model solutions to key questions that were developed and evaluated by a global frontline of healthcare providers, hospitals, suppliers, and policy makers.
We read the Novel-Corona-Virus-2019-dataset managed by Johns Hopkins University into this notebook. The dataset holds information about the cumulative case counts of COVID-19 Across the world. The dataset can be viewed and downloaded from - https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
The dataset CovCSD - COVID-19 Countries Statistical Dataset created by me (Available at https://www.kaggle.com/aestheteaman01/covcsd-covid19-countries-statistical-dataset) is loaded here. The information for the dataset can be seen at the description section for the dataset.
COVID-19 UNCOVER Collection of Datasets available from Kaggle.
US-Counties Covid-19 Dataset
With the risk for serious disease and death from Covid-19 rising with age. So there is increasing concern for adults who have a higher risk of developing serious illness if they are infected. -Reports from https://www.statnews.com news
The datasets mentioned under this challenge takes data from Worldometer which possess the similar figures for the age-group wise distribution of COVID-19 Cases. The figures mentioned there also highlights that people associated with an already exiisting COPD's or medical ailments have a higher risk of getting into a COVID-19 infection
We are going to analyze several datasets to understand this fact much better.
#Data Analyses Libraries
import pandas as pd
import numpy as np
from urllib.request import urlopen
import json
import glob
import os
#Importing Data plotting libraries
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as py
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.ticker as ticker
import matplotlib.animation as animation
%matplotlib inline
import plotly.express as px
import plotly.offline as py
import plotly.graph_objects as go
from plotly.subplots import make_subplots
#Other Miscallaneous Libraries
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')
from IPython.display import HTML
import matplotlib.colors as mc
import colorsys
from random import randint
import re
from pathlib import Path, PureWindowsPath
# Get the path of current working directory and add COVID directory path to it
path = os.getcwd()
biorxiv_dir = "C:\\Users\\Vejendla\\Desktop\\CIS-732\\novel-corona-virus-2019-dataset\\"
print(biorxiv_dir)
# Convert path to Windows format
#path_on_windows = PureWindowsPath(biorxiv_dir)
#print(path_on_windows)
filenames = os.listdir(biorxiv_dir)
print("Number of articles retrieved from biorxiv:", len(filenames))
#Reading the cumulative cases dataset
covid_cases = pd.read_csv(r'C:\Users\Vejendla\Desktop\CIS-732\novel-corona-virus-2019-dataset\covid_19_data.csv')
#Viewing the dataset
covid_cases.head()
The following are the procedures taken into consideration.
We group the dataset Country wise Data for country for which we waana check is later fetched from the main dataset generated.
#Groping the same cities and countries together along with their successive dates.
country_list = covid_cases['Country/Region'].unique()
country_grouped_covid = covid_cases[0:1]
for country in country_list:
test_data = covid_cases['Country/Region'] == country
test_data = covid_cases[test_data]
country_grouped_covid = pd.concat([country_grouped_covid, test_data], axis=0)
country_grouped_covid.reset_index(drop=True)
country_grouped_covid.head()
#Dropping of the column Last Update
country_grouped_covid.drop('Last Update', axis=1, inplace=True)
#Replacing NaN Values in Province/State with a string "Not Reported"
country_grouped_covid['Province/State'].replace(np.nan, "Not Reported", inplace=True)
#Printing the dataset
country_grouped_covid.head()
#country_grouped_covid holds the dataset for the country
#Creating a dataset to analyze the cases country wise - As of 04/26/2020
latest_data = country_grouped_covid['ObservationDate'] == '05/04/2020'
country_data = country_grouped_covid[latest_data]
#The total number of reported Countries
country_list = country_data['Country/Region'].unique()
print("The total number of countries with COVID-19 Confirmed cases as of 4th May 2020: {}".format(country_list.size))
#Creating the interactive map
py.init_notebook_mode(connected=True)
#GroupingBy the dataset for the map
DateCountry_cdf = covid_cases.groupby(['ObservationDate', 'Country/Region'])['Confirmed', 'Deaths', 'Recovered'].max()
DateCountry_cdf = DateCountry_cdf.reset_index()
DateCountry_cdf['Date'] = pd.to_datetime(DateCountry_cdf['ObservationDate'])
DateCountry_cdf['Date'] = DateCountry_cdf['Date'].dt.strftime('%m/%d/%Y')
DateCountry_cdf['log_ConfirmedCases'] = np.log(DateCountry_cdf.Confirmed + 1)
DateCountry_cdf['log_Fatalities'] = np.log(DateCountry_cdf.Deaths + 1)
#Plotting the figure
fig = px.choropleth(DateCountry_cdf, locations="Country/Region", locationmode='country names',
color="log_ConfirmedCases", hover_name="Country/Region",projection="mercator",
animation_frame="Date",width=1000, height=1000,
color_continuous_scale=px.colors.sequential.Viridis,
title='The Spread of COVID-19 Cases Across World')
#Showing the figure
fig.update(layout_coloraxis_showscale=True)
py.offline.iplot(fig)
#Plotting the figure for Fatalities
fig = px.choropleth(DateCountry_cdf, locations="Country/Region", locationmode='country names',
color="log_Fatalities", hover_name="Country/Region",projection="mercator",
animation_frame="Date",width=1000, height=1000,
color_continuous_scale=px.colors.sequential.Viridis,
title='The Deaths because of COVID-19 Cases')
#Showing the figure
fig.update(layout_coloraxis_showscale=True)
py.offline.iplot(fig)
#Plotting a bar graph for confirmed cases vs deaths due to COVID-19 in World using plotly
unique_dates = country_grouped_covid['ObservationDate'].unique()
confirmed_cases = []
recovered = []
deaths = []
for date in unique_dates:
date_wise = country_grouped_covid['ObservationDate'] == date
test_data = country_grouped_covid[date_wise]
confirmed_cases.append(test_data['Confirmed'].sum())
deaths.append(test_data['Deaths'].sum())
recovered.append(test_data['Recovered'].sum())
#Converting the lists to a pandas dataframe.
country_dataset = {'Date' : unique_dates, 'Confirmed' : confirmed_cases, 'Recovered' : recovered, 'Deaths' : deaths}
country_dataset = pd.DataFrame(country_dataset)
#Plotting the Graph of Cases vs Deaths Globally.
fig = go.Figure()
fig.add_trace(go.Bar(x=country_dataset['Date'],y=country_dataset['Confirmed'], name='Confirmed Cases of COVID-19', marker_color='rgb(55, 83, 109)'))
fig.add_trace(go.Bar(x=country_dataset['Date'],y=country_dataset['Deaths'],name='Total Deaths because of COVID-19',marker_color='rgb(26, 118, 255)'))
fig.update_layout(title='Confirmed Cases and Deaths from COVID-19',xaxis_tickfont_size=14,
yaxis=dict(title='Reported Numbers',titlefont_size=16,tickfont_size=14,),
legend=dict(x=0,y=1.0,bgcolor='rgba(255, 255, 255, 0)',bordercolor='rgba(255, 255, 255, 0)'),barmode='group',bargap=0.15, bargroupgap=0.1)
fig.show()
fig = go.Figure()
fig.add_trace(go.Bar(x=country_dataset['Date'], y=country_dataset['Confirmed'], name='Confirmed Cases of COVID-19', marker_color='rgb(55, 83, 109)'))
fig.add_trace(go.Bar(x=country_dataset['Date'],y=country_dataset['Recovered'],name='Total Recoveries because of COVID-19',marker_color='rgb(26, 118, 255)'))
fig.update_layout(title='Confirmed Cases and Recoveries from COVID-19',xaxis_tickfont_size=14,
yaxis=dict(title='Reported Numbers',titlefont_size=16,tickfont_size=14,),
legend=dict(x=0,y=1.0,bgcolor='rgba(255, 255, 255, 0)',bordercolor='rgba(255, 255, 255, 0)'),
barmode='group',bargap=0.15, bargroupgap=0.1)
fig.show()
On March 17th 2020, 56 Days post the first confirmed case of COVID-19. The Global Count of confirmed covid-19 cases crossed 200k mark.
Within 7 days, on 24th March 2020, the Global confirmed case count reached beyond 400k mark.
It took 3 days from March 24th 2020 to March 27th 2020, for global confirmed case count to reach 600k mark.
The same trends were observed of 3 days. On April 2, 2020 1m mark of COVID-19 was crossed.
Within the next 2 days, 200k more confirmed cases was added.
The total cumber of recovered cases was far more less than the confirmed cases. A total of 20.55% cases were recovered out of total confirmed cases as of April 6th 2020.
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
from pathlib import Path, PureWindowsPath
# Get the path of current working directory and add COVID directory path to it
path = os.getcwd()
biorxiv_dir = "C:\\Users\\Vejendla\\Desktop\\CIS-732\\uncover\\UNCOVER\\our_world_in_data\\"
print(biorxiv_dir)
# Convert path to Windows format
#path_on_windows = PureWindowsPath(biorxiv_dir)
#print(path_on_windows)
filenames = os.listdir(biorxiv_dir)
print("Number of articles retrieved from biorxiv:", len(filenames))
tests_by_country = pd.read_csv(biorxiv_dir+'total-covid-19-tests-performed-by-country.csv')
tests_per_million = pd.read_csv(biorxiv_dir+'total-covid-19-tests-performed-per-million-people.csv')
tests_vs_confirmed = pd.read_csv(biorxiv_dir+'tests-conducted-vs-total-confirmed-cases-of-covid-19.csv')
testspermillion_vs_confirmed = pd.read_csv(biorxiv_dir+'per-million-people-tests-conducted-vs-total-confirmed-cases-of-covid-19.csv')
tests_by_country.head(2)
tests_per_million.head(2)
tests_vs_confirmed.head(2)
testspermillion_vs_confirmed.head(2)
tests_merged = pd.merge(tests_by_country, tests_per_million, on='entity')
tests_merged = tests_merged.drop(['code_y', 'date_y'], axis = 1)
tests_merged = tests_merged.rename(columns = {'code_x': 'code','date_x':'date'})
tests_merged.head()
#Total Tests vs Entity/Country
sorted_by_tests = tests_merged.sort_values('total_covid_19_tests')
plt.figure(figsize=(30,25))
plt.barh('entity','total_covid_19_tests', data=sorted_by_tests)
plt.xlabel("total_covid_19_tests", size=15)
plt.ylabel("Salary in US Dollars", size=15)
plt.tick_params(axis='x', rotation = 90, labelsize = 18)
plt.tick_params(axis='y', labelsize = 18)
plt.title("total covid_19 Tests vs Entity", size=50);
#using choropleth for better vizualization
data = dict(type = 'choropleth',
locations = tests_merged['entity'],
locationmode = 'country names',
autocolorscale = False,
colorscale = 'Rainbow',
text= tests_merged['entity'],
z=tests_merged['total_covid_19_tests'],
marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
colorbar = {'title':'Tests Performed','len':0.25,'lenmode':'fraction'})
layout = dict(geo = dict(scope='world'), width = 1000, height = 600)
worldmap = go.Figure(data = [data],layout = layout)
iplot(worldmap)
# Total Tests vs Date
sorted_by_date = tests_merged.sort_values('date')
plt.figure(figsize=(25,10))
plt.barh('date','total_covid_19_tests', data=sorted_by_date)
plt.xlabel("total_covid_19_tests", size=15)
plt.ylabel("date", size=15)
plt.tick_params(axis='x', labelsize = 20)
plt.tick_params(axis='y', labelsize = 20)
plt.title("total covid_19 tests vs date", size=40);
#Analyzing Confirmed Cases
grouped_by_entity = tests_vs_confirmed.groupby('entity').sum()['total_confirmed_cases_of_covid_19_cases'].sort_values(ascending=False).to_frame(name = 'Sum').reset_index()
grouped_by_entity.head()
#### Removing the entry for 'World' and all the 0 confirmed entries
grouped_by_entity = grouped_by_entity[(grouped_by_entity['entity'] != 'World')]
grouped_by_entity = grouped_by_entity[(grouped_by_entity['Sum'] != 0)]
grouped_by_entity.size
grouped_by_entity = grouped_by_entity.sort_values('Sum').head(25)
plt.figure(figsize=(20,10))
plt.barh('entity','Sum', data=grouped_by_entity)
plt.xlabel("total_confirmed_cases_of_covid_19_cases", size=15)
plt.ylabel("entity", size=15)
plt.tick_params(axis='x', labelsize = 18)
plt.tick_params(axis='y', labelsize = 18)
plt.title("total_covid_19_confirmed_cases Vs entity", size=40);
Although the confirmed cases are relatively lesser in majority of the world but the below depiction shows the spread and the skewness of the confirmed cases
hospitalization = pd.read_csv(r'C:\Users\Vejendla\Desktop\CIS-732\uncover\UNCOVER\ihme\2020_03_30\Hospitalization_all_locs.csv')
hospitalization.head(2)
hospitalization.describe()
newICU = hospitalization.groupby('location').sum()['newICU_mean'].to_frame(name = 'New ICU').reset_index()
newICU.head()
newICU_by_location = newICU.sort_values('New ICU').head(25)
plt.figure(figsize=(20,10))
plt.barh('location','New ICU', data=newICU_by_location)
plt.xlabel("New ICU", size=15)
plt.ylabel("location", size=15)
plt.tick_params(axis='x', labelsize = 18)
plt.tick_params(axis='y', labelsize = 18)
plt.title("New ICU vs Location", size=30);
#Deaths vs Location
death = hospitalization.groupby('location').sum()['deaths_mean'].to_frame(name = 'Deaths').reset_index()
death.head()
death_by_location = death.sort_values('Deaths').head(25)
plt.figure(figsize=(20,10))
plt.barh('location','Deaths', data=death_by_location)
plt.xlabel("Deaths", size=15)
plt.ylabel("location", size=15)
plt.tick_params(axis='x', labelsize = 18)
plt.tick_params(axis='y', labelsize = 18)
plt.title("Deaths vs Location", size=40);
A mechanical ventilator is a machine that’s used to support patients with severe respiratory conditions that impact the lungs, including pneumonia.
Before a patient is placed on a ventilator, medical staff – often anaesthetists – will perform a procedure called intubation. After a patient is sedated and given a muscle relaxant, a tube is placed through the mouth and into the windpipe.
The procedure is routine but, with Covid-19 patients, medical staff need to take extreme precautions to make sure they do not become infected with the virus. The breathing tube is then attached to the ventilator and medical staff can adjust the rate that it pushes the air and oxygen into the lungs, and adjust the oxygen mix. https://www.theguardian.com/world/2020/mar/27/how-ventilators-work-and-why-they-are-so-important-in-saving-people-with-coronavirus
df = pd.read_csv(r'C:\Users\Vejendla\Desktop\CIS-732\uncover\UNCOVER\hifld\hifld\urgent-care-facilities.csv')
df.head(2).style.background_gradient(cmap='summer')
Once a doctor sees that a patient needs a ventilator, “it is required quickly”. “The patient can be sustained for short periods of time using manual forms of ventilation such as using a bag and mask system with oxygen, but usually being attached to a ventilator needs to happen within 30 minutes if critical.”
Story says that in severe Covid-19 patients, a life-threatening condition can develop called acute respiratory distress syndrome (Ards) that requires ventilators to deliver smaller volumes of oxygen and air, but at higher rates. This could mean a patient may need to be on a ventilator “for weeks”.
To avoid complications from the breathing tube going down the throat, a tracheostomy is carried out so the tube can go straight into the windpipe through the neck. “Patients can be more awake with tracheostomy and the hole just heals itself,”
df.st_vendor.value_counts()
One of the most obvious ways to avoid a shortage of ventilators, is to reduce the numbers of people catching the disease in the first place. That means following all the health advice, including social distancing and hygiene rules.
In Australia, the Australian Healthcare and Hospitals Association, the Australia and New Zealand Intensive Care Society and the industry minister, Karen Andrews, have all expressed confidence that a shortage can be avoided.The Australian government is also investigating whether ventilators used on animals in veterinary clinics could be converted. Sleep apnoea machines and anaesthetic machines are also options.
Ventilators used in ambulances could be used as back up.All of that work will be crucial in saving lives if the social distancing measures and community lockdowns don’t stem the flow of patients into critical care.“Health care workers responsible for managing severe life-threatening cases like Covid-19 are extremely concerned regarding their ability to use appropriate support for large numbers of patients expected to suffer respiratory failure.
“In essence, this means that many will not be able to be treated with mechanical ventilation and difficult decisions will have to be made by staff, families and patients about the limits of support. There are many ethical dilemmas in this, and none can be easily resolved.”
# Select lands that fall under the "WILD FOREST" or "WILDERNESS" category Alexis mini-course
vendors = df.loc[df.st_vendor.isin(['NAVTEQ', 'TGS'])].copy()
vendors.head(2)
#ventillator vendors
vendors.plot()
plt.style.use('dark_background')
sns.jointplot(df['naicscode'],df['fips'],data=df,kind='scatter')
fig=plt.gcf()
fig.set_size_inches(10,7)
fig=sns.violinplot(x='naicscode',y='fips',data=df)
plt.style.use('dark_background')
sns.set(style="darkgrid")
fig=plt.gcf()
fig.set_size_inches(10,7)
fig = sns.swarmplot(x="naicscode", y="fips", data=df)
Italian hospital saves Covid-19 patients lives by 3D printing valves for reanimation devices.
After the first valves were 3D printed using a filament extrusion system, on location at the hospital, more valves were later 3D printed by another local firm, Lonati SpA, using a polymer laser powder bed fusion process and a custom polyamide-based material.
The Isinnova team now developed and successfully tested a 3D printed adapter to turn a snorkeling mask into a non-invasive ventilator for COVID-19 patients. It’s an idea that anyone can 3D print using just about any type of 3D printer—and could help to address the possible shortage of hospital C-PAP masks for sub-intensive oxygen therapy, which is emerging as a concrete problem linked to the spread of COVID-19: an emergency ventilator mask, produced by adjusting a commercially available snorkeling mask.
plt.style.use('dark_background')
#sns.set(style="whitegrid")
fig=plt.gcf()
fig.set_size_inches(10,7)
ax = sns.violinplot(x="naicscode", y="fips", data=df, inner=None)
ax = sns.swarmplot(x="naicscode", y="fips", data=df,color="white", edgecolor="black")
df.plot.area(y=['naicscode','fips','zip','id'],alpha=0.4,figsize=(12, 6));
df.corr()
plt.figure(figsize=(10,4))
sns.heatmap(df.corr(),annot=True,cmap='YlOrRd_r')
plt.show()
According to Bill Blasio, Mayor of New York City, hospitals are running short on equipment. New York City hospitals are just 10 days from running out of “really basic supplies,” Mayor Bill de Blasio said late Sunday.
“If we don’t get the equipment, we’re literally going to lose lives,” de Blasio told CNN. De Blasio has called upon the federal government to boost the city’s quickly dwindling supply of protective equipment. The city also faces a potentially deadly dearth of VENTILATORS to treat those infected by the coronavirus. Health care workers also warned of the worsening shortages, saying they were being asked to reuse and ration disposable masks and gloves.
New York City hospitals scrambled lately to accommodate a new swell of patients, dedicating new COVID-19 wings in their facilities. It remained “extremely busy” at Northwell hospitals, a spokesman said, adding their intensive care units were filling up.
“A number of hospitals have reported that they are becoming overwhelmed,” said Jonah Allon, a spokeswoman for Brooklyn Borough President Eric Adams.
#heatmap with Pearson Method
corr = df.corr(method='pearson')
sns.heatmap(corr)
There Aren’t Enough Ventilators to Cope With the Coronavirus The United States and other countries face a critical shortage of the lifesaving machines — and no easy way to lift production.
As the United States braces for an onslaught of coronavirus cases, hospitals and governments are confronting a grim reality: There are not nearly enough lifesaving ventilator machines to go around, and there is no way to solve the problem before the disease reaches full throttle.
We use the datasets uploaded in this notebook to fetch out the important population parameters of the country.
#Generating a function to concatenate all of the files available.
folder_name = "C:\\Users\\Vejendla\\Desktop\\CIS-732\\covcsd-covid19-countries-statistical-dataset"
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True,sort=False)
#Selecting the columns that are required as is essential for the data-wrangling task
covid_data = dataframe[['Date', 'State', 'Country', 'Cumulative_cases', 'Cumulative_death',
'Daily_cases', 'Daily_death', 'Latitude', 'Longitude', 'Temperature',
'Min_temperature', 'Max_temperature', 'Wind_speed', 'Precipitation',
'Fog_Presence', 'Population', 'Population Density/km', 'Median_Age',
'Sex_Ratio', 'Age%_65+', 'Hospital Beds/1000', 'Available Beds/1000',
'Confirmed Cases/1000', 'Lung Patients (F)', 'Lung Patients (M)',
'Life Expectancy (M)', 'Life Expectancy (F)', 'Total_tests_conducted',
'Out_Travels (mill.)', 'In_travels(mill.)', 'Domestic_Travels (mill.)']]
covid_data.head()
#Filtering of the dataset to view the latest contents (as of 01-04-2020)
latest_data = covid_data['Date'] == '30-03-2020'
country_data_detailed = covid_data[latest_data]
#Dropping off unecssary columns from the country_data_detailed dataset
country_data_detailed.drop(['Daily_cases','Daily_death','Latitude','Longitude'],axis=1,inplace=True)
#Viewing the dataset
country_data_detailed.head(3)
#Replacing the text Not Reported and N/A with numpy missing value cmputation
country_data_detailed.replace('Not Reported',np.nan,inplace=True)
country_data_detailed.replace('N/A',np.nan,inplace=True)
#Viewing the dataset
country_data_detailed.head(3)
#Converting the datatypes
country_data_detailed['Lung Patients (F)'].replace('Not reported',np.nan,inplace=True)
country_data_detailed['Lung Patients (F)'] = country_data_detailed['Lung Patients (F)'].astype("float")
The dataset holds information about:
#Getting the dataset to check the correlation
corr_data = country_data_detailed.drop(['Date','State','Country','Min_temperature','Max_temperature','Out_Travels (mill.)',
'In_travels(mill.)','Domestic_Travels (mill.)','Total_tests_conducted','Age%_65+'], axis=1)
#Converting the dataset to the correlation function
corr = corr_data.corr()
def heatmap(x, y, size,color):
fig, ax = plt.subplots(figsize=(20,3))
# Mapping from column names to integer coordinates
x_labels = corr_data.columns
y_labels = ['Cumulative_cases', 'Cumulative_death']
x_to_num = {p[1]:p[0] for p in enumerate(x_labels)}
y_to_num = {p[1]:p[0] for p in enumerate(y_labels)}
n_colors = 256 # Use 256 colors for the diverging color palette
palette = sns.cubehelix_palette(n_colors) # Create the palette
color_min, color_max = [-1, 1] # Range of values that will be mapped to the palette, i.e. min and max possible correlation
def value_to_color(val):
val_position = float((val - color_min)) / (color_max - color_min) # position of value in the input range, relative to the length of the input range
ind = int(val_position * (n_colors - 1)) # target index in the color palette
return palette[ind]
ax.scatter(
x=x.map(x_to_num),
y=y.map(y_to_num),
s=size * 1000,
c=color.apply(value_to_color), # Vector of square color values, mapped to color palette
marker='s'
)
# Show column labels on the axes
ax.set_xticks([x_to_num[v] for v in x_labels])
ax.set_xticklabels(x_labels, rotation=30, horizontalalignment='right')
ax.set_yticks([y_to_num[v] for v in y_labels])
ax.set_yticklabels(y_labels)
ax.set_xticks([t + 0.5 for t in ax.get_xticks()], minor=True)
ax.set_yticks([t + 0.5 for t in ax.get_yticks()], minor=True)
ax.set_xlim([-0.5, max([v for v in x_to_num.values()]) + 0.5])
ax.set_ylim([-0.5, max([v for v in y_to_num.values()]) + 0.5])
corr = pd.melt(corr.reset_index(), id_vars='index')
corr.columns = ['x', 'y', 'value']
heatmap(x=corr['x'],y=corr['y'],size=corr['value'].abs(),color=corr['value'])
With a weak correlation we observe the following trends
With the rise in tempertaure, the confirmed cases tend to slow down (negative correlation). However substantial proof needs to be added here. For the sake of this in the upcoming versions of the notebook I'll analyze the trends for all the days to check the temperature.
Median age tends to affect the cases. So for a higher median age of the country cases tends to increase.
Life expectancy also seems to affect the COVID-19 confirmed cases with a weak correlation. The effect is seen more prominent in males than in females.
We keep forward to look with more dataset to analyze the correlation as since the correaltions obtained here are too weak.
#Reading the temperature data file
temperature_data = pd.read_csv(r'C:\\Users\\Vejendla\\Desktop\\CIS-732\\covcsd-covid19-countries-statistical-dataset\temperature_data.csv')
#Viewing the dataset
temperature_data.head(2)
#Checking the dependence of Temperature on Confirmed COVID-19 Cases
unique_temp = temperature_data['Temperature'].unique()
confirmed_cases = []
deaths = []
for temp in unique_temp:
temp_wise = temperature_data['Temperature'] == temp
test_data = temperature_data[temp_wise]
confirmed_cases.append(test_data['Daily_cases'].sum())
deaths.append(test_data['Daily_death'].sum())
#Converting the lists to a pandas dataframe.
temperature_dataset = {'Temperature' : unique_temp, 'Confirmed' : confirmed_cases, 'Deaths' : deaths}
temperature_dataset = pd.DataFrame(temperature_dataset)
#Plotting a scatter plot for cases vs. Temperature
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scattergl(x = temperature_dataset['Temperature'],y = temperature_dataset['Confirmed'], mode='markers',
marker=dict(color=np.random.randn(10000),colorscale='Viridis',line_width=1)),secondary_y=False)
fig.add_trace(go.Box(x=temperature_dataset['Temperature']),secondary_y=True)
fig.update_layout(title='Daily Confirmed Cases (COVID-19) vs. Temperature (Celcius) : Global Figures - January 22 - March 30 2020',
yaxis=dict(title='Reported Numbers'),xaxis=dict(title='Temperature in Celcius'))
fig.update_yaxes(title_text="BoxPlot Range ", secondary_y=True)
fig.show()
We import a dataset : Weather Data for COVID-19 Data Analysis uploaded by Davin Bonin. This dataset contains information about temperature and other weather figures for the countries confirmed with COVID-19 infections. The dataset is updated till April 14th 2020
sample = temperature_dataset['Temperature'].sample(n=250)
test = temperature_dataset['Temperature']
from scipy.stats import ttest_ind
stat, p = ttest_ind(sample, test)
print('Statistics=%.3f, p=%.3f' % (stat, p))
Since we get p value > 0.05 we can safely accept our null hypothesis and can conclude, that temperature affect on COVID-19 remains same over the population data. No statistical difference is present between the two datasets and the sole effect of temperature on spread of COVID-19 can be safely rejected. However, the idea of spread of COVID-19 across a certain range of temperature needs more dataset and statistical testing to come up with a substantial conclusion.
Do certain population/health demographics affects the spread of COVID-19 or is the spread completely random ? - Case Study of USA
We load the following datasets form the UNCOVER COVID-19 Challenge datasets
#Loading US County Wise Confirmed Cases Dataset
usa_cases_tot = pd.read_csv(r'C:\Users\Vejendla\Desktop\CIS-732\covcsd-covid19-countries-statistical-dataset\us-county.csv',dtype={"fips": str})
#Viewing the data
usa_cases_tot.head()
#Getting the geo-json files
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
#Plotting the data
py.init_notebook_mode(connected=True)
usa_cases_tot['log_ConfirmedCases'] = np.log(usa_cases_tot.Confirmed + 1)
usa_cases_tot['fips'] = usa_cases_tot['fips'].astype(str).str.rjust(5,'0')
fig = px.choropleth(usa_cases_tot, geojson=counties, locations='fips', color='log_ConfirmedCases',
color_continuous_scale="Viridis",
range_color=(0, 12),
scope="usa")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
py.offline.iplot(fig)
The spread of covid-19 is seen lately around the eastern coastal side of US. New York is the major epicenter for US and counties nearby New York have higher concentration of cases than those away from it. Area around Chicago has also higher cases density than other parts of US.
New York City, Nassau, Suffolk, Westchester has the highest reported cases of COVID-19 We would further look forward with the demographic distribution of these regions to analyze the trends on much better scale.
To analyze this statement we look forward to our generated dataset
#Getting the dataset to check the correlation
corr_data = usa_cases_tot.drop(['fips','state','county'], axis=1)
#Converting the dataset to the correlation function
corr = corr_data.corr()
#Plotting a heatmap
def heatmap(x, y, size,color):
fig, ax = plt.subplots(figsize=(20,10))
# Mapping from column names to integer coordinates
x_labels = corr_data.columns
y_labels = corr_data.columns
x_to_num = {p[1]:p[0] for p in enumerate(x_labels)}
y_to_num = {p[1]:p[0] for p in enumerate(y_labels)}
n_colors = 256 # Use 256 colors for the diverging color palette
palette = sns.cubehelix_palette(n_colors) # Create the palette
color_min, color_max = [-1, 1] # Range of values that will be mapped to the palette, i.e. min and max possible correlation
def value_to_color(val):
val_position = float((val - color_min)) / (color_max - color_min) # position of value in the input range, relative to the length of the input range
ind = int(val_position * (n_colors - 1)) # target index in the color palette
return palette[ind]
ax.scatter(
x=x.map(x_to_num),
y=y.map(y_to_num),
s=size * 1000,
c=color.apply(value_to_color), # Vector of square color values, mapped to color palette
marker='s')
# Show column labels on the axes
ax.set_xticks([x_to_num[v] for v in x_labels])
ax.set_xticklabels(x_labels, rotation=30, horizontalalignment='right')
ax.set_yticks([y_to_num[v] for v in y_labels])
ax.set_yticklabels(y_labels)
ax.set_xticks([t + 0.5 for t in ax.get_xticks()], minor=True)
ax.set_yticks([t + 0.5 for t in ax.get_yticks()], minor=True)
ax.set_xlim([-0.5, max([v for v in x_to_num.values()]) + 0.5])
ax.set_ylim([-0.5, max([v for v in y_to_num.values()]) + 0.5])
corr = pd.melt(corr.reset_index(), id_vars='index')
corr.columns = ['x', 'y', 'value']
heatmap(x=corr['x'],y=corr['y'],size=corr['value'].abs(),color=corr['value'])
#Plotting a scatter plot for cases vs. Temperature
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scattergl(y = usa_cases_tot['Traffic Volume'],x = usa_cases_tot['Confirmed'], mode='markers',
marker=dict(color=np.random.randn(10000),colorscale='Viridis',line_width=1)),secondary_y=False)
fig.update_layout(title='Daily Confirmed Cases (COVID-19) vs. Traffic Volume : US Figures - January 22 - April 14 2020',
xaxis=dict(title='Reported Numbers'),yaxis=dict(title='Traffic Volume'))
fig.show()
sample = usa_cases_tot['Traffic Volume'].sample(n=250)
test = usa_cases_tot['Traffic Volume']
from scipy.stats import ttest_ind
stat, p = ttest_ind(sample, test)
print('Statistics=%.3f, p=%.3f' % (stat, p))
None of the figures like Smokers percentage in population, obesity, diabetics tend to affect the spread of COVID-19 Infections in US in general.
A certain correaltion is observed with the number of confirmed cases in a county and the traffic congestion present for that county (as of 2020). The correaltion for the varibles are (0.613053). This might be significant as the quarantine and total isolation of people disallowing people movement across US Counties were late in comparison to countries like India/Korea/China/Japan. Hence asymptomatic cases that were carrying the virus might had spread the same, as the moment weren't restricted and the congestion of traffic for the particular counties are high.
The p-value is higher so significantly the null hypothesis can be accepted. We can't reject our null hypothesis over this case. However more reasearch is to be made to make this an evident conclusion.
Investigating the role of age/gender of population and it's realtion with the COVID-19 Virus infection rate.
JHU CSSE COVID-19 Dataset: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
Kaggle Dataset (30th March 2020): https://www.kaggle.com/aestheteaman01/covcsd-covid19-countries-statistical-dataset
Kaggle Kernel - Demographics & observation for pandemic escalation: https://www.kaggle.com/aestheteaman01/demographics-observation-for-pandemic-escalation
Kaggle Kernel - Ventilators Shortage: https://www.kaggle.com/mpwolke/ventilators-shortage